Document Clustering using Word Sense Disambiguation
نویسندگان
چکیده
In computational linguistics, word sense disambiguation (WSD) is the problem of determining in which sense a word having a number of distinct senses is used in a given sentence . This paper handles text document clustering as one of the major tasks of text processing. Document clustering is the process of finding out groups of information from the text documents and cluster these documents into the most relevant groups. Large document corpus suffers from ambiguity problems like synonyms, polysemous and other semantic relations. For this reason we perform WSD task for all terms in all documents to get the best sense to be used as document features in the clustering process. Our experimental results proved that the efficiency of document clustering using WSD increases linearly with the size of the documents dataset. Different part of speech (POS) taggers were tested to determine the best; also the effect of different window sizes on WSD task was compared.
منابع مشابه
Graph-based Word Clustering using a Web Search Engine
Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By cal...
متن کاملبررسی نقش انواع بافتار همنویسهها در تعیین شباهت بین مدارک
Aim: Automatic information retrieval is based on the assumption that texts contain content or structural elements that can be used in word sense disambiguation and thereby improving the effectiveness of the results retrieved. Homographs are among the words requiring sense disambiguation. Depending on their roles and positions in texts, homograph contexts could be divided to different types, wit...
متن کاملAn Efficient Technique to Improve Snippet Clustering
Document clustering is an effective tool to manage information overload. By grouping similar documents together, we enable a human observer to quickly browse large document collections, make it possible to easily grasp the distinct topics and subtopics. In this Paper we survey the most important problems and techniques related to text information retrieval: document pre-processing and filtering...
متن کاملImproving Summarization of Biomedical Documents Using Word Sense Disambiguation
We describe a concept-based summarization system for biomedical documents and show that its performance can be improved using Word Sense Disambiguation. The system represents the documents as graphs formed from concepts and relations from the UMLS. A degree-based clustering algorithm is applied to these graphs to discover different themes or topics within the document. To create the graphs, the...
متن کاملMeaningful Clusters
We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.
متن کامل